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Abstract 

Normal  mixture  models  are  widely  used  for  statistical  modeling  of  data,  including 
cluster  analysis.  However  maximum  likelihood  estimation  (MLE)  for  normal  mixtures 
using  the  EM  algorithm  may  fail  as  the  result  of  singularities  or  degeneracies.  To  avoid 
this,  we  propose  replacing  the  MLE  by  a  maximum  a  posteriori  (MAP)  estimator,  also 
found  by  the  EM  algorithm.  For  choosing  the  number  of  components  and  the  model 
parameterization,  we  propose  a  modified  version  of  BIC,  where  the  likelihood  is  eval¬ 
uated  at  the  MAP  instead  of  the  MLE.  We  use  a  highly  dispersed  proper  conjugate 
prior,  containing  a  small  fraction  of  one  observation’s  worth  of  information.  The  re¬ 
sulting  method  avoids  degeneracies  and  singularities,  but  when  these  are  not  present 
it  gives  similar  results  to  the  standard  method  using  MLE,  EM  and  BIC. 
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1  Introduction 


Finite  mixture  models  are  an  increasingly  important  tool  in  multivariate  statistics  (e.g. 
McLachlan  and  Basford  1988;  McLachlan  and  Peel  2000).  Approaches  to  density  estimation 
and  clustering  based  on  normal  mixture  models  have  shown  good  performance  in  many 
applications  (McLachlan  and  Peel  2000;  Fraley  and  Raftery  2002).  Despite  this  success, 
there  remain  issues  to  be  overcome.  One  is  that  standard  parameter  estimation  methods 
can  fail  due  to  singularity  for  some  starting  values,  some  models,  and  some  numbers  of 
components.  The  techniques  proposed  in  this  paper  are  largely  able  to  eliminate  these 
difficulties. 

We  propose  replacing  the  MLE  by  a  maximum  a  posteriori  (MAP)  estimator,  for  which 
we  use  the  EM  algorithm.  For  choosing  the  number  of  components  and  the  model  param¬ 
eterization,  we  propose  a  modified  version  of  BIC,  in  which  the  likelihood  is  evaluated  at 
the  MAP  instead  of  the  MLE.  We  use  a  highly  dispersed  proper  conjugate  prior,  containing 
a  small  fraction  of  one  observation’s  worth  of  information.  The  resulting  method  avoids 
degeneracies  and  singularities,  but  when  these  are  not  present  it  gives  similar  results  to  the 
standard  method  that  uses  MLE,  EM  and  BIC.  It  also  has  the  effect  of  smoothing  noisy 
behavior  of  the  BIC,  which  is  often  observed  in  conjunction  with  instability  in  estimation. 

This  paper  is  organized  as  follows.  In  Section  2,  we  give  a  brief  overview  of  model-based 
clustering,  which  can  also  be  viewed  as  a  proceedure  for  normal  mixture  estimation  that 
includes  model  selection,  both  in  terms  of  component  structure  and  number  of  components. 
In  Section  3,  we  describe  our  Bayesian  regularization  method  and  we  discuss  selection  of  prior 
hyperparameters  appropriate  for  clustering.  In  Section  4,  we  do  the  same  for  multivariate 
normal  mixtures.  In  Section  5,  we  give  examples  of  mixture  estimation  with  these  priors 
for  real  data.  Other  topics  treated  in  this  section  are  alternative  priors,  extension  to  other 
parameterizations  of  multivariate  normal  mixtures,  and  the  types  of  failures  that  can  still 
occur  when  a  prior  is  used  and  how  they  can  be  overcome.  In  Section  6  we  discuss  our  results 
in  the  context  of  other  research  and  further  application  of  this  approach. 

2  Methods 

2.1  Model-Based  Clustering 

In  model-based  clustering,  the  data  y  =  (y\, . . .  ,yn)  are  assumed  to  be  generated  by  a 
mixture  model  with  density 

n  G 

f(y)  =  II Tk  fk{yi  \  0k),  (i) 

i= 1  k= 1 
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where  fk(yi  \  Ok)  is  a  probability  distribution  with  parameters  Ok,  and  rk  is  the  probability  of 
belonging  to  the  fctli  component.  Most  often  (and  throughout  this  paper),  the  /fc  are  taken 
to  be  multivariate  normal  distributions,  parameterized  by  their  means  /i k  and  covariances 
E&: 

fk(Vi  I  0k)  =  cj)(yi  I  Hk,  Efc)  =  |27rEfc|~1/2exp  | --(?/*  -  -  /rfe)  j  , 

where  0k  =  (fj,k,  Efc). 

The  parameters  of  the  model  are  usually  estimated  by  maximum  likelihood  using  the 
Expectation-Maximization  (EM)  algorithm  (Dempster  et  ah  1977;  McLachlan  and  Krishnan 
1997).  Each  EM  iteration  consists  of  two  steps,  an  E-step  and  an  M-step.  Given  an  estimate 
of  the  component  means  (ij ,  covariance  matrices  Ej  and  mixing  proportions  the  E-step 
computes  the  conditional  probability  that  object  i  belongs  to  the  kth  component: 

G 

%ik  i\ l-^k ,  Efc)/  )  )  E j)  . 

3= 1 

In  the  M-step,  parameters  are  estimated  from  the  data  given  the  conditional  probabilities  Zik 
(see,  e.g.,  Cclenx  and  Govaert  1995).  The  E-step  and  M-step  are  iterated  until  convergence, 
after  which  an  observation  can  be  assigned  to  the  component  or  cluster  corresponding  to  the 
highest  conditional  or  posterior  probability.  The  results  of  EM  are  highly  dependent  on  the 
initial  values,  and  model-based  hierarchical  clustering  can  be  a  good  source  of  initial  values 
for  datasets  that  are  not  too  large  in  size  (Banheld  and  Raftery  1993;  Dasgupta  and  Raftery 
1998;  Fraley  1998). 

The  covariance  matrices  can  be  either  fixed  to  be  the  same  across  all  mixture  compo¬ 
nents,  or  allowed  to  vary.  In  general,  the  multivariate  normal  density  has  ellipsoidal  con¬ 
tours,  and  the  covariance  matrices  can  also  be  constrained  to  make  the  contours  spherical 
or  axis-aligned.  Other  parameterizations  are  possible  and  have  been  found  to  be  useful; 
regularization  of  these  is  discussed  in  Section  5.5. 

Several  measures  have  been  proposed  for  choosing  the  clustering  model  (parameterization 
and  number  of  clusters);  see,  e.g.,  Chapter  6  of  McLachlan  and  Peel  (2000).  We  use  the 
Bayesian  Information  Criterion  (BIG)  approximation  to  the  Bayes  factor  (Schwarz  1978), 
which  adds  a  penalty  to  the  loglikelihood  based  on  the  number  of  parameters,  and  has 
performed  well  in  a  number  of  applications  (e.g.  Dasgupta  and  Raftery  1998;  Fraley  and 
Raftery  1998,  2002).  The  BIG  has  the  form 

BIC  =  2  loglik^  (y,  0*k)  -  (#  par  arris)  ^  log(n),  (2) 

where  loglik M(y,0l)  is  the  maximized  loglikelihood  for  the  model  and  data,  (#  pararns) M 
is  the  number  of  independent  parameters  to  be  estimated  in  the  model  A4,  and  n  is  the 
number  of  observations  in  the  data. 
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The  following  strategy  for  model  selection  has  been  found  to  be  effective  in  mixture 
estimation  and  clustering: 

-  Specify  a  maximum  number  of  components,  Gmax,  to  consider,  and  a  set  of  candidate 
parameterizations  of  the  Gaussian  model. 

-  Estimate  parameters  via  EM  for  each  parameterization  and  each  number  of  compo¬ 
nents  up  to  G'max.  The  conditional  probabilities  corresponding  to  a  classification  from 
model-based  hierarchical  clustering,  or  the  estimated  parameters  for  a  simpler  (more 
parsimonious)  model,  are  good  choices  for  initial  values. 

-  Compute  BIG  for  the  mixture  likelihood  with  the  optimal  parameters  from  EM  for  up 
to  G'max  clusters. 

-  Select  the  model  (parameterization  /  number  of  components)  for  which  BIG  is  maxi¬ 
mized. 

For  a  review  of  model-based  clustering,  see  Fraley  and  Raftery  (2002).  Efficient  implemen¬ 
tation  for  large  datasets  is  discussed  in  Wehrens  et  al.  (2004)  and  Fraley  et  al.  (2005). 

The  EM  algorithm  can  fail  to  converge,  instead  diverging  to  a  point  of  infinite  likelihood. 
This  is  because  for  many  mixture  models  the  likelihood  is  not  bounded,  and  there  are  paths 
in  parameter  space  along  which  the  likelihood  tends  to  infinity  (Titterington,  Makov  and 
Smith  1985).  For  example,  in  the  univariate  normal  mixture  model  with  component-specific 
variances,  where  (1)  becomes 

n  G 

f(y)  =  II Tk<l>(yi\f*k,  crl),  (3) 

i=  1  k= 1 

the  likelihood  tends  to  infinity  along  any  path  in  parameter  space  along  which  /jfc  — >  and 
a\  — >  0,  for  any  i,  if  r/c  is  bounded  away  from  zero.  While  these  points  of  infinite  likelihood 
could  technically  be  viewed  as  maximum  likelihood  estimates,  they  do  not  possess  the  usual 
good  properties  of  MLEs,  which  do  hold  for  an  internal  local  maximum  of  the  likelihood 
(Redner  and  Walker  1984). 

In  practice,  this  behavior  is  due  to  singularity  in  the  covariance  estimate,  and  arises  most 
often  for  models  in  which  the  covariance  is  allowed  to  vary  between  components,  and  for 
models  with  large  numbers  of  components.  It  is  natural  to  wonder  whether  the  best  model 
might  be  among  the  cases  for  which  a  failure  is  observed,  and  to  seek  to  modify  the  method 
so  as  to  eliminate  convergence  failures. 

We  propose  to  avoid  these  problems  by  replacing  the  MLE  by  the  maximum  a  posteriori 
(MAP)  estimate  from  a  Bayesian  analysis.  We  propose  a  prior  distribution  on  the  parameters 
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that  eliminates  failure  due  to  singularity,  while  having  little  effect  on  stable  results  obtainable 
without  a  prior.  The  Bayesian  predictive  density  for  the  data  is  assumed  to  be  of  the  form 

C,{Y  |  Tfc,  yk,  S^)  ,  /ifc,  Sj.  |  0), 

where  C  is  the  mixture  likelihood: 

n  G 

£(Y  \tjc  IIE  Tk<f>(Vj  |/hfc,£fc) 

j= i  fc=i 

n  G  i  f  1  i 

=  iie  rk  |27rEfc|  2  exp  i  --(%■  -  nk)TY.kl{yj  -  fik)  f  , 

J=1  k= 1  1  Z  J 

and  "P  is  a  prior  distribution  on  the  parameters  rjt,  Hk  and  T,k,  which  includes  other  param¬ 
eters  denoted  by  6.  We  seek  to  find  a  posterior  mode  or  MAP  (maxiumum  a  posteriori) 
estimate  rather  than  a  maximum  likelihood  estimate  for  the  mixture  parameters. 

We  continue  to  use  BIC  for  model  selection,  but  in  a  slightly  modified  form.  We  replace 
the  first  term  on  the  right-hand  side  of  (2),  equal  to  twice  the  maximized  log-likelihood,  by 
twice  the  log-likelihood  evaluated  at  the  MAP  or  posterior  mode. 


3  Bayesian  Regularization  for  Univariate  Normal  Mix¬ 
ture  Models 


For  one-dimensional  data,  we  use  a  normal  prior  on  the  mean  (conditional  on  the  variance): 

H  |  a2  ~  A cr2 / nv) 


oc 


(4) 


and  an  inverse  gamma  prior  on  the  variance: 

a2  ~  inverseGamma(z/p/2,  <^/2) 

_  tz-p+2  f  2  'j 

“  FT  2  “*{-&}• 


(5) 


This  is  called  a  conjugate  prior  for  a  univariate  normal  distribution  because  the  posterior  can 
also  be  expressed  as  the  product  of  a  normal  distribution  and  an  inverse  gamma  distribution. 
The  hyperparameters  yv,  kv,  uv,  and  g2  are  called  the  mean,  shrinkage,  degrees  of  freedom 
and  scale,  respectively,  of  the  prior  distribution.  The  values  of  the  mean  and  variance  at  the 
posterior  mode  for  a  univariate  normal  under  this  prior  are: 

ny  +  Kv/av 

h  =  - ; - 

kv  +  n 
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Table  1:  M-step  Estimators  for  the  Mean  and  Variance  of  Univariate  Mixture  Models  Under 
the  Normal  Inverse  Gamma  Conjugate  Prior.  The  two  rows  for  the  variance  correspond  to  the 
assumptions  of  equal  or  unequal  variance  across  components.  Here  Zjk  is  the  conditional  probability 
that  observation  j  belongs  to  the  hth  component,  n k  =  YYj= 1  zjk  and  yk  =  YYj= 1  zjkUj/Tik- 


Parameter 

Without  Prior 

With  Prior 

fj'k 

Vk 

k^kVk  T  Zip/ip 

Zip  +  Tlk 

a2 

EfcL i  £"=  i  ZjkiVj  -  Vk)2 

4  +  ££=i  hSSfoG/*  -  M2  +  £/=1  zjk{yj  -  yk)2\ 

n 

i /p  -f"  ti  H-  G  -\-  2 

£"=1  zjk(Vj  ~  Vk)2 

4  +  (KpTnfc)  (Vk  +  £jU  ZjkiVj  VkY 

nk 

lY'p  “h  Tlk  +  3 

?2  4  +  g&v  -  M2  +  UUte  -  s)2 

+  n  +  3 

For  derivations,  see  the  appendix. 

For  univariate  mixtures,  it  is  usually  assumed  that  either  all  of  the  components  share  a 
common  variance  cr2,  or  else  that  the  variance  is  allowed  to  vary  freely  among  the  components. 
Constraining  the  variance  to  be  equal  is  a  form  of  regularization,  and  singularities  are  not 
typically  observed  when  this  is  done.  Singularities  often  arise,  however,  when  the  variance 
is  unconstrained. 

We  use  the  normal  inverse  gamma  conjugate  prior  (4)  and  (5),  and  take  the  prior  dis¬ 
tribution  of  the  vector  of  component  proportion  (ti,  . . . ,  Tq)  to  be  uniform  on  the  simplex. 
Then  the  M-step  estimators  are  given  in  Table  1.  The  prior  hyperparameters  (/ip,  kv ,  vv ,  q2 ) 
are  assumed  to  be  the  same  for  all  components.  For  derivations,  see  the  appendix. 

We  make  the  following  choices  for  the  prior  hyperparameters: 

/ip :  the  mean  of  the  data, 

hiv:  .01 

The  posterior  mean  k^k - can  ^g  viewed  as  adding  kv  observations  with  value 

Zip  +  Tlk 

/ip  to  each  group  in  the  data.  The  value  we  used  was  determined  by  experimentation; 
values  close  to  and  bigger  than  1  caused  large  perturbations  in  the  modeling  in  cases 
where  there  were  no  missing  BIC  values  without  the  prior.  The  value  .01  resulted  in 


BIC  curves  that  appeared  to  be  smooth  extensions  of  their  counterparts  without  the 
prior. 

Up\  d  2  =  3 

The  marginal  prior  distribution  of  ^  is  a  Student’s  t  distribution  centered  at  jiv  with 
uv  —  d  + 1  degrees  of  freedom.  The  mean  of  this  distribution  is  jiv  provided  that  uv  >  d, 
and  it  has  a  finite  variance  provided  that  uv  >  d  +  1  (see,  e.g.  Schafer  1997).  We  chose 
the  smallest  integer  value  for  the  degrees  of  freedom  that  gives  a  finite  variance. 

gf:  var(data)  empjrjca]  variance  of  the  data  divided  by  the  square  of  the  number 
of  components.)  The  resulting  prior  mean  of  the  precision  corresponds  to  a  standard 
deviation  is  one  Gth  that  of  all  of  the  data,  where  G  is  the  number  of  components. 
This  is  roughly  equivalent  to  partitioning  the  range  of  the  data  into  G  intervals  of  fairly 
equal  size. 


4  Bayesian  Regularization  for  Multivariate  Normal  Mix¬ 
tures 


For  multivariate  data,  we  use  a  normal  prior  on  the  mean  (conditional  on  the  covariance 
matrix) : 

H  |  E  ~  A/”(/ip,  T./kv) 


oc 


exp 


K-p 

-—trace 


(/i  flp)  S  (/i  flp) 


and  an  inverse  Wishart  prior  on  the  covariance  matrix: 


(6) 


E  ~  inverseWishart  ( ur ,  Av ) 


oc 


E 


tz-p+d+1  ( 

2  exp  j  —  -trace 


(7) 


As  in  the  univariate  case,  the  hyperparameters  /iP,  kv  and  vv  are  called  the  mean ,  shrinkage 
and  degrees  of  freedom  respectively,  of  the  prior  distribution.  The  hyperparameter  Av,  which 
is  a  matrix,  is  called  the  scale  of  the  inverse  Wishart  prior.  This  is  a  conjugate  prior  for  a 
multivariate  normal  distribution  because  the  posterior  can  also  be  expressed  as  the  product 
of  a  normal  distribution  and  an  inverse  Wishart  distribution.  Under  this  prior,  the  posterior 
means  of  the  mean  vector  and  the  covariance  matrix  are: 


A  ng  T  Kpfip 

h  =  - ; - 

kv  +  n 


and 

^  _  K1  +  QggO  (V  -  hv)(y  -  hv)T  +  E"=i {yj  -  y)(yj  -  V)T 

i)p  +  d  +  2 
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The  normal  inverted  Wishart  prior  and  its  conjugacy  to  the  multivariate  normal  are  discussed 
in  e.g.  Gclman  et  al.  (1995)  and  Schafer  (1997).  The  derivations  leading  to  these  results  are 
given  in  the  appendix. 

4.1  Multivariate  Mixture  Models 

For  multivariate  normal  mixtures,  the  contours  of  the  component  densities  are  ellipsoidal, 
and  the  component  covariance  matrices  can  be  constrained  so  that  the  contours  are  spherical 
(proportional  to  the  identity  matrix),  or  axis-aligned  (diagonal).  It  is  usually  assumed  either 
that  all  the  components  share  a  common  covariance  matrix,  or  that  the  covariance  matrix 
can  vary  freely  between  the  components.  As  in  the  univariate  case,  constraining  the  variance 
to  be  equal  is  a  form  of  regularization,  and  failures  in  estimation  are  not  typically  observed 
when  this  is  done  except  when  large  numbers  of  mixture  components  are  involved,  causing 
the  mixing  proportions  of  some  components  to  shrink  to  zero.  Constraining  the  covariance 
matrix  to  be  diagonal  or  spherical  is  also  a  form  of  regularization  for  multivariate  data. 
Although  singularities  may  arise  when  the  covariance  is  restricted  to  one  of  these  forms 
but  otherwise  allowed  to  vary  among  components,  they  occur  less  frequently  than  when  the 
covariance  matrix  is  unconstrained. 

Under  the  conjugate  prior  with  the  inverse  gamma  prior  (5)  on  the  variance  components 
for  the  diagonal  and  spherical  models,  the  inverse  Wishart  prior  (7)  on  the  covariance  for  the 
ellipsoidal  models,  and  the  normal  prior  (6)  on  the  mean,  the  M-step  estimators  are  given 
in  Table  2.  The  prior  hyperparameters  (kv,uv,  Av,q^)  are  assumed  to  be  the  same  for  all 
components.  For  derivations,  see  the  appendix. 

4.2  Multivariate  Prior  Hyperparameters 

We  make  the  following  choices  for  the  prior  hyperparameters  for  mulivariate  mixtures. 

/ip:  the  mean  of  the  data. 

kv:  .01:  the  same  reasoning  applies  as  in  the  univariate  case. 
u-p'.  d  T  2 

Analogously  to  the  univariate  case,  the  marginal  prior  distribution  of  p  is  multivariate 
t  centered  at  /iP  with  uv  —  d  + 1  degrees  of  freedom.  The  mean  of  this  distribution  is  nv 
provided  that  vv  >  d,  and  it  has  a  finite  covariance  matrix  provided  uv  >  d  +  1  (see, 
e.  g.  Schafer  1997).  We  chose  the  smallest  integer  value  for  the  degrees  of  freedom 
that  gives  a  finite  covariance  matrix. 
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Table  2:  M-step  estimators  for  the  mean  and  variance  of  multvariate  mixture  models  under  the 
normal  inverse  gamma  and  normal  inverse  Wishart  conjugate  priors.  The  rows  for  the  variance 
correspond  to  the  assumptions  of  equal  or  unequal  spherical  variance  across  components,  and  equal 
or  unequal  ellipsoidal  variance  across  components.  Here  Zjk  is  the  conditional  probability  that 
observation  j  belongs  to  the  kth  component,  nk  =  Y^j=\zjk->  Ilk  =  Y)j=i  zjkyj/nk,  and  114  = 
Y0j= 1  zjk{Uj  ~  Vk)(yj  —  Vk)T ,  and  e,  is  the  ith  column  of  the  identity  matrix. 


Parameter 

Without  Prior 

With  Prior 

f^k 

Vk 

rikUk  T  n-pdr 

K-p  T  nk 

a2 

Efc=i  trace  (W4) 

+  EfcLi  trace 

[(KV+nk)(yk  dr)(yk  dr)  +  Wk\ 

nd 

z/p  (72  -I-  G)d  “I-  2 

trace  (114) 

+  trace 

(Kv+nk)(yk  dr)(yk  dr)  +Wk\ 

rihd 

z/p  H-  Tikd  d  2 

diag(42) 

diag  (<;2  +  ef  Ek=1 

Hv)(Vk  ^)T  +  Wk\ 

bi 

n 

Z/p  ~\~  71  -\-  2 

diag($fc) 

diag  (114) 

diag  [q2  +  ef 

/■*)&  ^F  +  IU] 

e* 

nk 

v-p  H -  Tik,  “I-  2 

£ 

tLi  w* 

+  Efc=i 

[(Kp+i»fc)(^  dr)(Vk  dr)  +Wk 

n 

z/p  -\-Ti-\-G-\-d-\-\ 

£k 

wk 

h-v  +  (Kp+nfc)  (?/fe  dr)(dk  dr)  +  H  k 

nk 

z/p  +  Ttk.d  +  d  +  2 
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A  sum(diag(jar  (data) ) )/d  (For  spherical  and  diagonal  models.)  The  average  of  the 
diagonal  elements  of  the  empirical  covariance  matrix  of  the  data  divided  by  the  square 
of  the  number  of  components  to  the  1/d  power. 

Av\  Vaig2A^a^  (For  ellipsoidal  models.)  The  empirical  covariance  matrix  of  the  data 
divided  by  the  square  of  the  number  of  components  to  the  1/d  power. 

The  volume  of  the  ellipsoid  defined  by  <//  or  Av  is  one  G2th  of  the  volume  of  the  ellipsoid 
defined  by  the  empirical  covariance  matrix  of  all  of  the  data,  where  G  is  the  number  of 
components. 

5  Examples 

Figure  1  displays  the  symbols  used  in  the  BIC  plots  throughout  this  section  along  with  their 
associated  model  parameterization. 


Spherical/Univariate: 

A  Ell/E  equal  variance 

A  Vll/V  unconstrained 

Diagonal: 

%  EEI  equal  variance 

A  EVI  equal  volume 

A  VEI  equal  shape 

Q  VVI  unconstrained 

Ellipsoidal: 

■  EEE  equal  variance 

A  EEV  equal  volume  &  shape 

A  VEV  equal  shape 

A  VVV  unconstrained 


Figure  1:  Legend  for  BIC  Plots.  Different  symbols  correspond  to  different  model  parameterizations. 
The  three  letter  codes  are  those  used  to  designate  equal  (E)  or  varying  (V)  shape,  volume,  and 
orientation,  respectively,  in  the  MCLUST  software  (Fraley  and  Raftery  1999,  2003).  The  letter  I 
designates  a  spherical  shape  or  an  axis-aligned  orientation. 
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Table  3:  Source  and  Description  for  the  Three  Univariate  Data  Sets  Used  in  the  Examples. 


dataset 

A  observations 

reference 

Acidity 

155 

Crawford  et  al.  1992 

Enzyme 

245 

Bechtel  et  al.  1993 

Galaxy 

82 

Roeder  1990 

5.1  Univariate  Examples 

Figure  2  shows  histograms  and  model-based  clustering  results  for  the  three  univariate  datasets 
analyzed  in  Richardson  and  Green  (1997).  The  data  are  described  in  Table  3.  The  equal- 
variance  model  is  started  with  the  outcome  of  model-based  hierarchical  clustering  for  that 
model,  and  the  unequal  variance  model  is  started  with  the  result  of  the  equal  variance  model1. 
Note  that  without  the  prior,  there  are  no  results  available  for  the  unconstrained  variance 
model  when  the  number  of  components  is  sufficiently  large.  The  reason  is  that  parameters 
could  not  be  estimated  due  to  singularities. 

For  the  acidity  data,  the  standard  BIG  values  based  on  the  MLE  are  not  available  for  the 
unequal  variance  model  with  five  or  more  mixture  components.  In  the  five-component  case, 
the  EM  algorithm  hits  a  path  around  the  250th  iteration  along  which  the  variance  for  the 
first  component  tends  rapidly  to  zero  and  the  likelihood  diverges  to  infinity  (see  Table  4). 
With  the  prior,  BIG  values  are  available  for  all  models  and  numbers  of  groups.  The  results 
are  similar  with  and  without  the  prior  in  cases  where  both  are  available,  and  the  overall 
conclusion  is  unchanged. 

For  the  enzyme  data,  including  the  prior  allows  us  to  assess  solutions  more  than  eight 
components,  but  does  not  otherwise  change  the  analysis. 

For  the  galaxy  data,  BIG  without  the  prior  chooses  six  components  with  unequal  vari¬ 
ances,  by  a  small  margin  over  the  three-components  model,  while  with  the  prior  it  chooses 
three  components  fairly  clearly.  This  dataset  has  been  extensively  analyzed  in  the  literature, 
including  a  number  of  studies  using  mixtures  of  normals. 

Roeder  and  Wasserman  (1997)  chose  three  components  using  an  approach  similar  to  ours 
based  on  MLE  via  EM  and  BIC;  their  choice  of  three  rather  than  six  components  seems 
to  be  due  to  their  using  a  different  (and  lower)  local  mode  of  the  likelihood  for  the  six- 
component  model.  Richardson  and  Green  (1997)  did  a  Bayesian  analysis  using  reversible 
jump  MCMC,  and  reported  a  posterior  distribution  with  both  mode  and  median  at  six 
components,  although  they  indicated  later  that  convergence  may  not  have  been  achieved 
(Richardson  and  Green  1997,  page  789). 

Figure  3  shows  the  classifications  and  the  estimated  densities  for  these  two  competing 
1This  initialization  differs  from  the  default  in  the  MCLUST  software  for  the  unequal  variance  case. 
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histogram 


BIC  (no  prior) 


BIC  (with  prior) 


3  4  5  6  7 

Acidity  Data 


0.0  0.5  1.0  1.5  2.0  2.5  3.0 

Enzyme  Data 


10  15  20  25  30  35 

Galaxy  Data 


Figure  2:  BIC  for  the  Univariate  Acidity,  Enzyme  and  Galaxy  Datasets.  The  histogram  for  the 
dataset  is  given  in  the  left  column,  while  plots  of  the  number  of  components  versus  BIC  values 
are  shown  for  the  model-based  clustering  without  (center)  and  with  (right).  The  equal  variance 
model  is  indicated  by  filled  symbols,  while  the  model  in  which  the  variance  is  allowed  to  vary  across 
components  is  indicated  by  open  symbols. 

models.  They  agree  that  the  seven  smallest  observations  form  a  group,  as  do  the  largest 
three.  They  disagree  about  whether  the  middle  72  observations  can  be  adequately  described 
by  one  normal  density,  or  whether  four  normal  components  are  needed.  The  figure  suggests 
that  several  of  the  groups  in  the  six-group  solution  may  be  spurious. 

One  can  also  shed  some  light  on  this  issue  by  assessing  the  fit  of  a  normal  distribution  to 
the  middle  72  observations.  Figure  4(a)  shows  the  cumulative  distribution  function  (CDF) 
for  the  full  galaxy  dataset,  together  with  the  CDFs  for  99  datasets  of  the  same  size  simulated 
from  the  fitted  normal  distribution.  Thirteen  of  the  82  observations  lie  outside  the  pointwise 
band,  in  line  with  the  well-accepted  conclusion  that  one  normal  does  not  fit  the  entire  dataset. 
Figure  4(b)  shows  the  same  thing  for  the  middle  72  observations;  the  entire  empirical  CDF 
lies  inside  the  band,  suggesting  that  one  normal  distribution  is  adequate  for  this  group. 
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Table  4:  Values  of  erg  from  the  M-step  of  EM  for  the  Acidity  Data  Under  the  Five-Component, 
Unconstrained  Variance  Model  Without  the  Prior.  The  variance  for  one  of  the  components  falls 
below  the  threshold  for  failure  due  to  singularity. 


trg,  k  =  1, ...  ,5 


iteration 

1 

2 

3 

4 

5 

246 

247 

248 

249 

250 


0.046731 

0.056739 

0.065152 

0.072763 

0.079982 

0.22418 

0.171405 

0.108567 

0.038607 

0.000307 


0.018704 

0.025452 

0.031345 

0.036744 

0.041453 

0.049378 

0.049469 

0.049606 

0.049819 

0.050004 


0.071418 

0.085215 

0.092778 

0.097979 

0.102011 


0.182963 

0.183083 

0.183261 

0.183493 

0.183625 


0.058296 

0.070723 

0.077127 

0.080919 

0.08329 

0.063843 

0.063836 

0.063829 

0.063823 

0.063815 


0.024956 

0.030543 

0.033145 

0.034621 

0.035558 

0.044054 

0.044056 

0.044057 

0.044058 

0.04406 


Kolmogorov-Smirnov  test  statistics  for  testing  a  normal  distribution  tell  a  similar  story:  the 
P  value  for  the  full  dataset  is  0.005,  while  that  for  the  middle  72  observations  is  0.254. 
Note  that  these  results  are  based  on  estimated  parameters,  which  is  anti-conservative.  If 
account  were  taken  of  parameter  uncertainty,  the  tests  would  be  less  likely  to  reject  the 
null  hypothesis  of  normality,  and  so  the  conclusion  for  the  middle  72  observations  would  be 
unchanged.  Overall,  this  analysis  provides  support  for  the  three-group  solution. 
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density 


Without  Prior 


With  Prior 


iiiiiiiiiiiim  i 

ii  uiiiiiiiiii 

ii 

Hill  II 

mill  i  iiiiiiiiiiiiiiiiiniiiiiiii  mu  i  i  ill 

10  15  20  25  30  35 


ii  ii  uiiiiiiiiii  uiiiiiiiiii  mi  i  i 


III!  II 


mill  ii  ii  uiiiiiiiiii  uiiiiiiiiii  mi  i  i 

10  15  20  25  30 


Figure  3:  Classifications  (top)  and  Densities  Corresponding  to  the  Mixture  Models  Fitted  to  the 
Univariate  Galaxy  Data  Without  and  With  the  Prior.  In  the  classification  plots,  all  of  the  data  is 
shown  at  the  bottom,  while  the  different  classes  are  separated  on  the  lines  above. 


16  18  20  22  24  26 

Middle  72  observations 


(a)  Full  Galaxy  Dataset  (b)  Middle  Group  of  Size  72 

Figure  4:  The  Empirical  CDF  (red)  Together  with  Empirical  CDFs  for  99  Datasets  of  the 
Same  Size  Simulated  from  the  Fitted  Normal  Distribution  (black)  for  (a)  the  Full  Galaxy 
Dataset,  and  (b)  the  Middle  72  Observations. 
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5.2  Butterfly  Example 


In  this  section  we  consider  observations  from  the  butterfly  dataset  (Celeux  and  Robert, 
1993),  consisting  of  the  measurements  of  the  widths  of  the  upper  and  lower  wings  for  23 
butterflies,  shown  in  Figure  5.  The  original  goal  was  to  ascertain  how  many  species  were 
present  in  the  sample  and  classify  the  butterflies  into  them. 


20  22  24  26  28 


Upper  Wing  Width 


Figure  5: 


Wing  widths  from  the  butterfly  dataset.  There  are  23  observations. 


Figure  6  shows  the  BIC  for  equal  variance  and  unconstrained  variance  models,  assuming 
spherical,  diagonal,  and  elliptical  shapes  (the  models  for  which  priors  were  discussed  in 
Section  4).  For  all  models,  EM  was  started  using  the  results  from  model-based  hierarchical 
clustering  on  the  unconstrained  ellipsoidal  variance  model.  The  model  and  classification 
chosen  according  to  the  BIC  are  the  same  regardless  of  whether  or  not  the  prior  is  imposed, 
but  failures  due  to  singularity  for  the  unconstrained  models  are  eliminated  with  the  prior. 

The  four-component  unconstrained  model  without  the  prior  fails  at  the  outset.  The 
hierarchical  clustering  result  based  on  the  unconstrained  model  used  for  initialization  assigns 
a  single  observation  to  one  of  the  groups  in  this  case.  In  this  case,  the  Bayesian  regularization 
allows  the  identification  of  a  group  with  a  single  member  while  allowing  the  covariance  matrix 
to  vary  between  clusters,  which  is  not  possible  without  the  prior. 
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without  prior 


with  prior 


2  4  6  8 

number  of  components 


Figure  6:  The  top  row  gives  the  BIC  for  the  six  models  for  variables  3  and  4  the  butterfly 
dataset,  while  the  middle  row  shows  details  of  the  maximum  BIC  models.  Refer  to  Figure  1  for 
the  correspondence  between  symbols  and  models.  The  bottom  row  displays  a  projection  of  the 
data  showing  the  classification  corresponding  to  the  maximum  BIC.  Failures  due  to  singularities  no 
longer  occur  when  the  prior  is  included  in  the  model,  although  the  BIC  selects  a  model  mapping 
to  the  same  four  group  classfication  regardless  of  whether  or  not  the  prior  is  imposed. 
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5.3  Trees  Example 


In  this  section  we  analyze  the  trees  dataset  (Ryan  et  al.  1976)  included  in  the  R  lan¬ 
guage  (www.r-project.org).  A  pairs  plot  of  the  data  is  shown  in  Figure  7.  There  are  31 
observations  of  3  variables. 


Figure  7:  Pairs  plot  of  the  trees  dataset.  There  are  31  observations. 

Figure  8  shows  the  BIC  for  the  trees  data  for  the  equal  variance  and  unconstrained 
variance  models,  assuming  spherical,  diagonal,  and  ellipsoidal  shapes.  For  the  equal  variance 
models,  EM  was  started  using  the  results  from  model-based  hierarchical  clustering  assuming 
no  constraints  on  the  variance.  For  the  unconstrained  variance  model,  EM  was  started  using 
the  results  from  the  equal  variance  model. 

Figure  9  shows  the  3  and  5  group  classifications  where  the  BIC  has  peaks  without  the 
prior,  and  the  2  group  classification  corresponding  to  the  maximum  BIC  with  the  prior.  The 
six-component  unconstrained  model  fails  to  converge  without  the  prior  in  this  case  because 
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without  prior 


with  prior 


2468  10  2468  10 

number  of  components  number  of  components 


Figure  8:  The  top  row  gives  the  BIC  for  the  six  models  for  the  trees  dataset,  while  the  bottom 
row  shows  details  of  the  BIC  near  the  maximum.  Refer  to  Figure  1  for  the  correspondence  between 
symbols  and  models.  Two  groups  are  favored  over  five  when  the  prior  is  imposed. 

one  of  the  covariances  becomes  singular  as  the  EM  iterations  progress,  as  shown  in  Figure 

10. 
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without  prior 


with  prior 


Figure  9:  Trees  Data:  A  Projection  of  the  Data  Showing  the  Three  and  Five  Group  Classifications, 
where  the  BIC  has  peaks  without  the  prior,  and  the  2  group  classification  where  the  BIC  peaks 
with  the  prior. 
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without  prior 


with  prior 


after  first  iteration  converged  model 

Figure  10:  For  a  projection  of  the  trees  data,  the  top  row  shows  the  initial  equal-variance  models 
fitting  a  six-component  unconstrained  model  without  and  with  the  prior.  The  bottom  left  figure 
shows  the  six-component  model  after  the  first  iteration  without  the  prior,  showing  the  covariance  for 
component  6  collapsing  to  near  singularity.  At  the  bottom  left  is  the  six-component  unconstrained 
model  converged  fit  with  the  prior. 


23 


Table  5:  Summary  Results  (mean,  standard  deviation  (‘sd’),  nrin,  max)  for  the  BIC  and  Number 
of  Components  for  the  Brain  MRI  Data,  with  and  without  the  prior.  EM  was  initialized  with 
parameters  from  model-based  hierarchical  clustering  over  10  different  initial  random  samples  of 
size  2,371.  Note  that  there  is  much  greater  variability  in  both  the  BIC  and  number  of  components 
without  the  prior. 


number  of  mixture  components 

BIC 

mean  sd  min 

max 

mean 

sd 

min 

max 

without  prior 

29.3  7.4  15 

38 

-756,481 

533 

-757,870 

-756,051 

with  prior 

25.4  2.8  21 

30 

-756,486 

270 

-756,889 

-756,090 

5.4  Brain  MRI 

These  data  describe  a  four-band  magnetic  resonance  image  (MRI)  consisting  of  23,712  pixels 
of  a  brain  with  a  tumor  2.  Because  of  the  size  of  the  dataset,  it  is  not  feasible  to  do  model- 
based  hierarchical  clustering  on  the  entire  dataset  to  obtain  initial  values  for  EM  estimation, 
so  instead  we  apply  hierarchical  clustering  to  a  random  sample  of  the  observations  to  get 
initial  parameter  estimates  for  EM  (e.g.  Wehrens  et  al.  2004;  Fraley  et  al.  2005).  Table  5 
summarizes  the  results  for  the  BIC  and  number  of  clusters  for  the  brain  MRI  initialized  with 
model-based  hierarchical  clustering  over  10  different  initial  random  samples  of  size  2,371. 
The  main  observation  is  that  the  variability  in  both  the  BIC  values  and  the  number  of 
components  is  greatly  reduced  by  imposing  the  prior,  which  seems  to  stabilize  the  results. 

Figure  11  shows  classifications  obtained  with  and  without  the  prior  for  one  of  the  initial 
samples. 

5.5  Other  Parameterizations  and  Priors 

Banfield  and  Raftery  (1993)  expressed  the  covariance  matrix  for  the  k- th  component  of  a 
multivariate  normal  mixture  model  in  the  form 

=  XkDkAkDk ,  (8) 

where  Dk  is  the  matrix  of  eigenvectors  determining  the  orientation,  Ak  is  a  diagonal  matrix 
proportional  to  the  eigenvalues  determining  the  shape,  and  Xk  is  a  scalar  determining  the 
volume  of  the  cluster.  They  used  this  formulation  to  define  a  class  of  hierarchical  clustering 
methods  based  on  cross-cluster  geometry,  in  which  mixture  components  may  share  a  common 
shape,  volume,  and/or  orientation.  This  approach  generalizes  a  number  of  existing  clustering 
methods.  For  example  if  the  clusters  are  restricted  to  be  spherical  and  identical  in  volume, 

2These  data  were  obtained  from  Professor  Ron  Wehrens,  Department  of  Analytical  Chemistry,  Univeristy 
of  Nijmegen,  and  Professor  A.  Heerschap  of  the  Radboud  University  Medical  Center,  Nijmegen,  The  Nether¬ 
lands. 


24 


without  prior 


with  prior 


Figure  11:  An  instance  of  the  MRI  classification  without  and  with  prior  using  the  same  initial 
sample. 

the  clustering  criterion  is  the  same  as  that  which  underlies  Ward’s  method  (Ward  1963)  and 
h- means  clustering  (MacQueen  1967).  Banfield  and  Raftery  (1993)  developed  this  class  of 
models  in  the  context  of  hierarchical  clustering  estimated  using  the  classification  likelihood, 
but  the  same  parameterizations  can  also  be  used  with  the  mixture  likelihood.  A  detailed 
description  of  14  different  models  that  are  possible  under  this  scheme  can  be  found  in  Celeux 
and  Govaert  (1995). 

Many  of  these  models  can  be  estimated  by  the  MCLUST  software  (Fraley  and  Raftery 
1999,  2003),  and  there  they  are  designated  by  a  three- letter  symbol,  where  the  three  letters 
indicate  volume,  shape  and  orientation,  respectively.  The  letter  E  indicates  cross-cluster 
equality,  while  V  denotes  freedom  to  vary  across  clusters,  and  the  letter  I  designates  a 
spherical  shape  or  an  axis-aligned  orientation. 

It  is  possible  to  formulate  priors  that  eliminate  singularity  in  a  manner  similar  to  that 
used  to  derive  the  priors  in  Table  2  for  most  cases  of  the  parameterization  (8).  These 
priors  are  summarized  in  Table  6.  The  cases  for  which  no  prior  is  given  are  those  for  which 
neither  the  volume  and  shape  nor  the  shape  and  orientation  can  be  treated  as  a  unit  across 
components.  For  other  models  that  are  not  covered  in  Table  2  of  Section  4,  the  M-step 
computations  would  need  to  be  derived  in  the  manner  described  in  Celeux  and  Govaert 
(1995)  once  the  prior  terms  have  been  added  to  the  complete  data  loglikelihood. 

We  have  obtained  good  results  with  an  alternative  heuristic  that  can  be  applied  to  all 
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Table  6:  Parameterizations  of  the  Covariance  Matrix  via  Eigenvalue  Decomposition,  with  the 
associated  prior.  A  is  a  scalar  controling  volume,  A  a  diagonal  matrix  controling  shape,  and  D  an 
orthogonal  matrix  corresponding  to  the  eigenvectors  of  the  covariance  matrix  controling  orientation. 
The  subscript  k  indicates  variation  across  components  in  the  mixture  model. 


MCLUST  symbol 

parameterization 

prior 

applied  to 

Eli 

XI 

inverse  gamma 

A 

¥11 

A  fcJ 

inverse  gamma 

xk 

EE  I 

XA 

inverse  gamma 

each  diagonal  element  of  XA 

VEI 

XkA 

EVI 

X  Ak 

VVI 

XkAk 

inverse  gamma 

each  diagonal  element  of  A kAk 

EEE 

xdadt 

inverse  Wishart 

E  =  XDADt 

VEE 

XkDADT 

inverse  gamma 

xk 

inverse  Wishart 

E  =  DADt 

EVE 

XDAkDT 

WE 

XkDAkDT 

inverse  gamma 

each  diagonal  element  of  A kAk 

EEV 

XDkADTk 

inverse  gamma 

each  diagonal  element  of  XA 

VEV 

XkDkADk 

EVV 

XDkAkDk 

inverse  gamma 

X 

inverse  Wishart 

Ejt  =  DkAkD k 

VVV 

X  kDkAkDk 

inverse  Wishart 

Sfc  =  XkDkAkDk 

parameterizations  based  on  the  decomposition  (8):  noting  that  the  computations  involved 
in  the  M-step  without  the  prior  involve  the  weighted  sums  of  squares  and  products  matrix 

n 

wk  =  Y^  zikhjj  -  Vk){yj  -  Vk)T 

3= 1 

(see  Celeux  and  Goavert  1995),  we  replace  Wk  in  the  M-step  formulas  with  their  analogs 
from  Table  2. 

Figures  12  and  13  show  results  for  the  trees  dataset  for  ten  models,  including  four  (VEI, 
EVI,  EEV,  and  VEV)  for  which  we  use  the  heuristic  just  described.  In  Figure  12,  all  models  are 
initialized  with  the  result  from  model-based  hierarchical  clustering  using  the  unconstrained 
model,  so  there  are  some  differences  with  Figure  8  in  those  instances  where  the  models  are 
the  same.  Without  a  prior,  the  model  fragments  the  data  into  several  small,  highly  ellipsoidal 
components  (see  Figure  13).  There  is  one  component  to  which  only  one  observation  would 
be  assigned  according  to  its  highest  conditional  probability.  With  the  prior,  a  model  with 
fewer  components  is  selected. 
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2468  2468 

number  of  components  number  of  components 


Figure  12:  BIC  for  Ten  Models  for  the  Trees  Dataset.  Refer  to  Figure  1  for  the  correspondence  be¬ 
tween  symbols  and  models.  Without  the  prior,  the  BIC  is  maximized  with  the  same  five-component 
unconstrained  model  obtained  in  Section  5.3,  with  the  steep  increase  and  cutoff  in  BIC  suggesting 
near  singularity.  With  the  prior  it  is  maximized  at  at  two-component  model  in  which  the  volume 
and  shape  of  the  covariances  of  the  components  in  the  model  are  the  same,  but  their  orientation 
may  vary.  Compare  with  Figure  8,  which  shows  the  BIC  for  the  restricted  set  of  six  models  for 
which  an  explicit  prior  is  available. 


5.6  Vanishing  Components 

Of  note  in  Figure  12  is  that  the  estimation  for  the  trees  data  fails  for  some  models,  even 
when  the  prior  is  imposed.  For  this  example,  all  failures  without  the  prior  were  due  to 
singularity  in  the  estimated  covariance  matrices,  while  all  failures  with  the  prior  were  due  to 
the  mixing  proportion  of  one  or  more  components  shrinking  to  zero.  In  this  latter  case,  it  is 
still  possible  to  estimate  the  BIC,  although  no  associated  parameter  or  loglikelihood  values 
are  available.  This  is  accomplished  by  adding  the  appropriate  terms  penalizing  the  number 
parameters  of  parameters  (formula  (2)  in  Section  2.1)  to  the  loglikchood  available  for  the 
largest  number  of  components  with  the  same  parameterization  scheme.  Figure  14  plots  the 
BIC  for  mixture  estimation  with  the  prior  for  the  trees  data,  with  the  values  for  models  that 
include  vanishing  components  shown. 


6  Discussion 

We  have  proposed  a  Bayesian  regularization  method  for  avoiding  the  singularities  and  de¬ 
generacies  that  can  arise  in  estimation  for  model-based  clustering  using  the  EM  algorithm. 
The  method  involves  a  dispersed  but  proper  conjugate  prior,  and  uses  the  EM  algorithm  to 
find  the  posterior  mode,  or  MAP  estimator.  For  model  selection  it  uses  a  version  of  BIC 
that  is  slightly  modified  by  replacing  the  maximized  likelihood  by  the  likelihood  evaluated  at 
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without  prior 


with  prior 


Figure  13:  LEFT:  A  Projection  of  the  Trees  Data  Showing  the  Model  and  Classification  Cor¬ 
responding  to  the  Five-Component  Unconstrained  Model  (highest  BIC),  and  to  the  7-conrponent 
constant  volume  and  shape  model  without  the  prior  (second  highest  BIC).  RIGHT:  a  projection  of 
the  data  showing  the  model  and  classification  corresponding  to  the  two-conrponent  constant  volume 
and  shape  model  with  the  prior  (highest  BIC).  Compare  with  Figure  9,  which  shows  the  choice 
from  a  restricted  set  of  six  models  using  an  explicit  prior. 

the  MAP.  In  application  to  a  range  of  datasets,  the  method  did  eliminated  singularities  and 
degeneracies  observed  in  maximum  likelihood  methods,  while  having  little  effect  on  stable 
results. 

In  model-based  clustering,  parameterization  through  eigenvalue  decomposition  with  ex¬ 
plicit  eigenvector  normalization  (8)  allows  cross-component  constraints  on  the  geometry 
(shape,  volume  and  orientation)  of  normal  mixture  components,  and  constitutes  a  form  of 
regularization  that  has  been  exploited  both  in  clustering  (Banfield  and  Raftery  1993;  Celeux 
and  Govaert  1995;  Fraley  and  Raftery  1998,  2002)  and  in  discriminant  analysis  (Flury  1988; 
Bensmail  and  Celeux  1996;  Fraley  and  Raftery  2002).  With  this  scheme  models  can,  for 
example,  be  constrained  to  be  either  spherical  or  diagonal  and  have  either  fixed  or  vary- 
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number  of  components 

Figure  14:  BIC  plot  for  mixture  estimation  with  the  prior  for  the  trees  data,  with  values  for 
models  that  include  vanishing  components  shown.  Refer  to  Figure  1  for  the  correspondence  between 
symbols  and  models. 

ing  covariance  across  components;  there  are  a  number  of  intermediate  possibilities  as  well. 
Failures  due  to  singularity  are  less  likely  to  occur  with  models  having  less  variation  across 
components,  although  with  a  corresponding  loss  of  modeling  flexibility. 

In  this  paper  we  have  limited  our  treatment  to  cases  in  which  there  are  more  observa¬ 
tions  than  variables.  However,  there  are  many  instances,  such  as  gene  expression  data  (e.g. 
Quackenbush  2001),  where  the  reverse  is  true.  While  the  methods  given  here  will  not  fail 
due  to  singular  covariance  estimates  when  the  prior  on  the  covariance  is  nonsingular  (for 
example,  diag(varjylata) )  cou|(j  usecj  if  n  <  ^  more  sophisticated  techniques  are  usually 
needed  to  obtain  useful  clustering.  One  approach  is  to  adopt  a  mixture  of  factor  analyz¬ 
ers  model  (Ghahramani  and  Hinton  1997;  McLachlan  et  al.  2003),  in  which  the  covariance 
matrix  has  the  form 

Yjk  —  Dk  +  , 

where  Df;  is  diagonal  and  Bk  is  a  d  x  m  matrix  of  factor  loadings,  with  m  -C  d.  Some 
regularization  is  still  required  to  prevent  the  elements  of  Dk  from  vanishing  during  estimation; 
this  could  be  done,  for  example,  by  imposing  a  common  value  of  Dk  across  components. 
Another  approach  to  mixture  modeling  of  high- dimensional  data  is  variable  selection, 
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which  arises  as  an  important  consideration  in  many  types  of  modeling,  even  when  there  are 
more  observations  than  attributes  or  variables.  Usually  variable  selection  is  accomplished 
by  a  separate  (often  heuristic)  procedure  prior  to  analysis.  Raftery  and  Dean  (2005)  devel¬ 
oped  a  model-based  clustering  that  incorporates  variable  selection  as  an  integral  part  of  the 
procedure.  It  would  be  possible  to  modify  their  method  by  including  a  prior  distribution  as 
we  have  done  here. 

The  presence  of  noise  in  the  data  can  have  a  significant  effect  on  density  estimation 
and  clustering.  Spurious  singularities  can  be  introduced  clue  to  noise  when  fitting  Gaussian 
mixtures.  In  mixture  modeling,  one  approach  is  to  add  a  term  with  a  first-order  Poisson 
distribution  to  the  mixture  model  to  account  for  noise  (Banfield  and  Raftery  1993;  Dasgupta 
and  Raftery  1998);  our  Bayesian  regularization  approach  could  also  be  applied  to  these 
models.  An  alternative  is  to  model  with  mixtures  of  t  distributions,  which  have  broader  tails 
than  normals  (McLachlan  and  Peel  1998;  McLachlan  et  al.  1999). 

Mkhadri  et  al  (1997)  reviewed  regularization  techniques  in  discriminant  analysis,  in¬ 
cluding  parameterization  of  normal  mixture  models  by  eigenvalue  decomposition  (8)  and 
Bayesian  estimation  using  conjugate  priors  analogous  to  those  we  have  used  here  in  the  case 
of  unconstrained  covariance.  These  methods  were  extended  to  cases  with  mixed  discrete  and 
continuous  variables  by  Merbouha  and  Mkhadri  (2004). 

Several  studies  have  used  the  EM  algorithm  to  extimate  the  posterior  mode  in  a  Bayesian 
approach  for  mixture  models.  Roberts  et  al.  (1998)  used  a  Dirichlet  prior  on  the  mixing 
proportions  and  a  noninformative  prior  on  the  elements  of  the  means  and  covariances,  while 
Figueiredo  and  Jain  (2002)  used  noninformative  priors  on  all  of  the  parameters  to  be  es¬ 
timated.  Brand  (1999)  proposed  an  entropic  prior  on  the  mixing  proportions,  and  applied 
his  method  to  hidden  Markov  models  as  well  as  Gaussian  mixtures.  These  methods  work 
by  starting  with  more  components  than  necessary,  and  then  pruning  those  for  which  mixing 
proportions  are  considered  negligible. 

Bayesian  estimation  for  mixture  models  can  be  done  via  Markov  chain  Monte  Carlo 
simulation,  using  priors  on  the  means  and  covariances  similar  to  the  ones  we  have  used  here 
(e.g.  Lavine  and  West  1992,  Diebolt  and  Robert  1994,  Crawford  1994,  Bensmail  et  al.  1997, 
Richardson  and  Green  1997,  Dcllaportas  1998,  Bensmail  and  Meulman  2003,  Zhang  et  al. 
2004,  Bensmail  et  al.  2005).  We  have  shown  that  Bayesian  estimation  using  the  posterior 
mode  from  EM  is  straightforward  for  many  models,  and  that  these  results  can  be  used  for 
approximate  Bayesian  estimation  in  other  models  (Section  5.5).  Thus,  for  many  applications 
it  is  not  clear  that  the  use  of  MCMC  for  mixture  estimation  and  model-based  clustering  is 
needed  given  its  computational  demands. 
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A  Algebraic  Relations 


£"=i (yj  ~  v){yj 


noting  that 

The  term  E”=i  (Vj 


utTj  1v  =  trace  =  trace  (E  1vuT^j  (9) 

h)T  =  E"=i {VjVj  ~  wj -VjVT  +  wT) 

=  E"=i  Vj yj  ~  n/iyT  -  nyyT  +  n/i/jT 
=  E"=i  VjVj  ~  nyyT  +  nyyT  -  nyyT  -  nyyT  +  n/i/jT 
=  E"=i  VjyJ  ~  nyyT  +  n(y  -  y)(y  -  y)T 
=  E"=1  j/jj/J*  -  E"=i  yyj  -  E"=i  yjyT  +  E"=i  yyT  +  n(y  -  v)  (y  -  y)T 
=  E”=1  (yj  -  y)(yj  -  y)T  +  n(y  -y){y-  h)T, 

=  w(y  -y)(y-  y)T  +  E”=i (yj  -  y){yj  -  V)T, 

(10) 

n  n  n 

nyf  =  EyyJ  =  Ey>yT  =  E^T 

1=1  1=1  1=1 

—  y)(yj  —  y)T  is  the  sums  of  squares  and  products  matrix. 
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M  y  -  -  /v)T  +  n(y  -y)(y~  y)T 

=  nviiyT  —  —  K-p/d-p/j1^  +  Kpfipfip  +  nyyr  —  n/iy  —  nyyT  +  nyyT 

=  (kp  +  n)yyT  —  nyyT  —  nv/j,yp  —  nyyiT  —  nvypyT  +  Kp/ipfip  +  nyyT 
=  (/%,  +  n)nn  —  nyyT  —  nvyiip  —  nyyT  —  KvyvyT  +  Kp/^p/ip  +  nyyT 


=  (kp  +  n)fiyT  —  n(ny  +  hiryv)T  —  ( ny  +  nvyv)iiT  + 

1 


Kpfipfip  +  nyy 


=  { Up  +  n)  n  - 


(kv  +  n) 
ny  Ki-pjjj-p 


(< kv  +  n) 
(ny  +  KvHp)(ny  +  KP/ip)T 


(ny  +  KvHp)(ny  +  Kp/ip)1 


kv  +  n 


y  - 


ny  +  hipfip 


T 


kv  +  n 


Kp/ip/ip  +  nyyT 


( n2yyT  +  nvniivyl  +  nnvyHp  +  hip/ip/ip) 


-T 


-  T 


(kv  +  n) 

=  (kp  +  n)(y  -  jxv)(y  -  yv)T  +  k 


(hip  n)  rp  (hip  ~l~  /r  j _ ^ 


(kv  +  n) 


n 


;yy  - 


-T 


ns, 


(kv  +  n)  (kp  +  n) 

=  (nv  +  n)(/i  -  jxv)(y  -  fip)T 


yvy  - 


(kv  +  n) 


“I-  n 


Wv 


(k,p  +  n) 


yy 


n 


(kv  +  n) 


v  T 

fj/-p  fj/sp 


nnv  rp 
-yvyr  + 


Kvn 


(nv  +  n) 


(up  +  n) 


yy 


Kj-pn  _rp 

- ; — yvl) 

Up  +  n 


UKp  _  rp 

- ; — Wv 

kv  +  n 


=  (hip  +  n)(n  -  p,r)(n  -  yv)T  + 


KpTl 


(hip  +  n) 


(y  -  yv)(y  -  yv? 


where 


yv  — 


n  \ 
Up  +  n) 


y  + 


(  Kp  \ 

\nv  +  n) 


(11) 
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By  analogy  to  the  derivation  of  (10), 

E"=  1  zjk(yj  -  fik)(yj  -  n k)T  =  E"=i  ZjkiyjyJ  -  HkyJ  -  y^l  +  nknl) 

EJ=i  zjkyjyj  nknkyk  kikykyk  ~t~  'kikykyk 
=  E"=i  ZjkVjyJ  -  nkykVl  +  nkykyl  ~  nkHkVk  ~  nkVkljJ  +  nkyk^l 
=  E"=i  ZjkVjyJ  -  nkVkVk  +  nk(yk  -  i^k)(yk  -  i^k)T 

=  E"=i  -  E”=i  -  E”=i  ZjkyjyJ  +  E”=i  Zjkykyl  +  nk(yk  -  Hk)(yk  -  Hk)T 
=  E"=i  zjk(yj  -  yk){yj  -  Vk)T  +  nk(yk  -  Hk)(yk  -  y>k)T, 

=  nk(yk  -  lik)(yk  -  l^k)T,  +  E”=1  Zjk{yj  -  s7fc)(s/j  ~  2/fc)T 


where  2^  is  the  probability  that  the  jth  observation  belongs  to  the  fcth  component, 


zJk  and  Vk  =  ~Y  zikyj, 

3= 1 

and  we  have  used  the  relation 


nfc  t=i 


nkyy  =  Y  ZjkVkVj  =  Y  Zjkyjy1  =  Y  zjkyy 

3= l  i=l  J=l 

By  analogy  to  the  derivation  of  (11),  we  also  have 

Kv(/kk  ~  Hr){Hk  ~  H v)T  +  nk(yk  -  yk){ Hk  -  Vk)T 


(ftp  klk)  (  [3 k 
kl-pTlk 


kl'kyk  "0  H-p/3-p 


+ 


kv  +  nk) 


kv  +  nk 
(yk  -  yv){yk  ~  Hv)T 


yk 


kl'kyk  ~t~  kt-pf^V 
kv  +  nk 


(12) 


(13) 


B  The  Univariate  Normal  Distribution  and  Normal  In¬ 
verted  Gamma  Conjugate  Prior 

The  likelihood  for  the  univariate  normal  (n  observations)  is 


=  n 


3= 1 


27r<T2 


exp 


2cr2 


(%'  - 


(— !— 

\27r(T2 


exp 


2a2 


Yivj-vY 


3= 1 


(14) 


/  1 


\27T<72 


exp 


i  d  \ ' 

■n  +  T7  L  Vi 


la2  a2 


j= 1 


2^ 

J=1 
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A  conjugate3  prior  that  is  commonly  used  for  the  univariate  normal  is  the  normal  inverse 
gamma  prior,  in  which  the  prior  for  /j  conditional  on  a2  is  normal: 


h  I  o'2  ~  A f{nT,  cr2 / kv)  oc  exp  |-^  (/i  -  (iv) 

and  the  prior  for  a2  is  inverse  gamma: 

a2  ~  inverseGamma(z/p/2,  <^2/2)  oc  (cr2) 


_ 


exp 


s-p 


2cr2 


The  hyperparameters  fj,r,  kv,  uv,  and  q2  are  called  the  mean,  shrinkage,  degrees  of  freedom 
and  scale,  respectively,  of  the  prior  distribution.  The  prior  is  therefore  proportional  to 


1 


—  exp 
cr2 


Ki 


(/i  -  HV)2  \  X  (cr2)  ^  exp 


G) 


^-p+3 

2 


exp 


2cr2 

{-£  K  +  M/z  -  hv)2]} 


2a2 


^-p+3 
1  \  2 


cr^ 


J  .  h  1  I  2  ,  2 

6XP  1  _2^p  +  ^KvfIv  ~  2^  +  KrlLp 


Combining  (14)  and  (15),  the  posterior  is  proportional  to 

iz-p+n+3  ^  ^ 

\gl  +  Kv(n~  nv)2  +  YsiVj  -  hf 


1  \  2“ 


crz 


exp 


2cr2 


1= 1 


and  the  log  posterior  (leaving  out  the  constant  terms)  is: 


vv  +  n  +  3 


log  a2  - 


2cr2 


(h  _  hv)  +  —  h)~ 

3= 1 


From  the  one  dimensional  case  of  (10) 


_  a)2  =  n(y  -  /i)2  +  -  2/)2 

t=i  l=i 


and  the  one  dimensional  case  of  (11) 


«p(A*  -  Ab>)2  +  n(v  ~  h)2  =  («P  +  n)(/z  -  /A>)2  +  -^—{y  -  /v)2, 

Kp  +  n 


(15) 


(16) 


where 

^  ny  ec-p/a-p 

hv  =  ;  i 

kp  +  n 

3 A  conjugate  prior  is  one  in  which  the  posterior  has  the  same  form  as  the  prior,  but  with  different 
hyperparameters. 
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the  complete  data  posterior  can  be  expressed  as  the  product  of  a  normal  distribution  and 
an  inverse  gamma  distribution: 


iAp+n+3 

1  \ 


er- 


exp 


kv  -\-  n  .  ~  \2  i 

(/x-/iP)  >  exp 


2a2 


2a'2 


£p  + 


Kvn 
kp  +  n 


(y  -  yvf  +  Efe  -  yf 


3= i 


The  posterior  mode  therefore  corresponds  to 


and 


d2  = 


A  _  _  _  ny  +  Kv[iP 

/ 1  =  AG  =  ;  i 

kv  +  n 

4  +  pgtfe  -  M2  +  -  5)2 


+  n  +  3 


(17) 


(18) 


B.l  Univariate  Normal  Mixtures 

The  univariate  normal  mixture  likelihood  or  density  (n  observations,  G  components)  has  the 
form 


71  G 

C(Y  I  r,  /x,  tr2)  =  n  E  rk(t>{yj  I  yk,  cr2k),  (19) 

j=  1  k=  1 

where  0  is  as  defined  in  (14). 

The  M-step  of  the  EM  algorithm  for  maximum  loglikelihood  is  the  maximization  of 

G  n 

EE  zjk  {logTfc  +  log <f>(yj  I  yk,crl)}  ,  (20) 

k=lj=l 

where  z]k  is  the  probability  (computed  in  the  E-step)  that  the  jth  observation  belongs  to 
the  A: th  component.  The  M-step  for  posterior  modes  is  the  maximization  of  the  sum  of  (20) 
and  the  log  prior  (Dempster,  Laird  and  Rubin  1977). 


B.1.1  Unequal  Variance 

Assuming  that  /ip,  kv,  vv  and  y2  are  the  same  for  each  component,  the  normal  inverted 
gamma  prior  for  a  univariate  normal  mixture  with  unequal  variance  is  proportional  to: 


nbi) 


^•p+3 

2 


exp 


k= 1 


2  al 


exp 


2  a2 


(yu  yv) 


n  (4 


^-p+3 

2 


exp 


k= 1 


y|  +  Kv(fj,k  -  yvy 

2  al 


(21) 
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The  M-step  objective  for  maximum  loglikclihood  (20)  can  be  expanded  as  follows: 


G  n 


EE  zjk  { log  Tk  +  log  |//fc,<7fc)} 


k=lj=l 

£  I  ^  ^  k«(2»)  i  ^  ,  2  i  ^  (»  - 

=  X  1  X  ->  log  "1  -  2_  ’jk - - - W  2^  2jt  log  ak  -  -  ^  Zjk - -y 


O  z _ /  o  Ae  O 

k= 1  l  i=i  J=1  "  z  i=i  z  i=i 


E  j  ,  nk  log(2vr)  nk  2 

=  2^  <  nk  log  Tfc - - - —  iogo-fc 


fe=i 


2^2  i=1 


^2zjk(Vj  l^k) 


n  log(27r) 


G 


+  {  nk  log  Tk  -  ^  log  o\  -  -E  E  z3k{yj  -  l-lk) 


k= 1 


2tr 


k  7=1 


(22) 


where  nk  =  Y3j=\  Zjk  and  Erti  to.  =  n. 

The  function  to  be  maxmized  in  the  M-step  to  estimate  the  posterior  mode  is  the  sum  of 
the  log  posterior  (21)  and  equation  (22),  and  can  be  expressed  as: 


E  Vv  +  3 

h  2 


1  2  "f"  ^7 o(/ifc  /ip) 

log  Vk - 


2^2 


G  |  ^  ^  n 

+  E  i  nk  log  rfc  -  E  !°g  -  7TT  E  z3k(yj  ~  Vkf 

k= l  (  z  Z(Tfc  j=i 

f  ,  ^p  +  3  +  Tlk  2 

=  E  W  loS  Tk - o - log  °k 

k= 1  L  2 


T  inp(/ifc  /i p )  T  ^  )  Zjk(]Jj  f^k) 

3= 1 


2°i 


(23) 


Using  (12)  and  (13),  the  M-step  objective  (23)  can  be  written  as: 


v=  f  .  z/p  T  3  +  nk  2  ^p  T  kik 

E  1  nk  log  Tk - 2 -  0g  - 1  ^ 

fc=l  l 


2  Kp'iifc 

<b>  + 


TkkUk  T  Kp/ip 


1  \  2 


ftp  +  Ufc 


2*2 


„  ,  \  (Vk  ~  ^v)2  +  E  Zjk(Vj  ~  Vkf 

1%-p  -f-  Ufa)  j  =  i 


(24) 


The  M-step  estimators  for  the  mean  and  variance  are  then 


~  kUk  T  ftp/ip 

/ifc  —  : 

ftp  +  nk 


and 


= 


4  +  (EfeE  -  /E2  +  gU  m-E  - 

ftp  +  nk  +  3 


40 


B.1.2  Equal  Variance 


Assuming  that  /j,v,  kVi  vv  and  <^2  are  the  same  for  each  component,  the  normal  inverted 
gamma  prior  for  a  univariate  normal  mixture  with  equal  variance  (<r|  =  a2)  is  proportional 
to: 

“"P+2  (  r-2\G  1  ( 

Iv-p 


_ v-p 


exp 


r2  'I  G 
s-p 


2a2 


nb2) 


exp 


fc=i 


2  al 


h-p) 


b2) 


v"p-\-G-\-2 
2 


exp 


+  Efc=iM/bfe  -  flvf 

2a2 


(25) 


The  M-step  objective  for  maximum  loglikclihood  (20)  has  the  following  expansion  under  the 
equal  variance  (<r2  =  a2)  assumption: 


G  n 


EE  zjk  {logTfc  +  \0g4>{Vj  |^,0-2)} 


k= 1  j= 1 


^  V-  l  log(27T(T2)  1  ^  (yj  -  /ifc  2 

=  E  i  E  z3k  log  Tfc  -  2^  -ifc - « - 9  E  U* - -2 - 

k=l  (i=l  3= 1  J=1 


E  J  ,  n/,.log(27T(T2' 

=  2^  <  nk  log  Tk 


(26) 


fe=l 


^  )  Zikilli  fJ>k) 


n  log(27 r<r2 


G 


Y  {  nk  log  rk  -  —  Y  Zjkiyj  -  Vk)2  )  , 


k=  1 


where  nk  =  E”=i  U'fc  and  Efc=i  nk  =  n. 

The  function  to  be  maximized  in  the  M-step  is  the  sum  of  the  log  posterior  (25)  and  on  (26), 
and  can  be  expressed  as: 

?p  +  Efc=i  M/be-/v)2 


k'-p  +  G  +  2 


log  cr  - 


2<r2 


G 


-  -  log  U2  +  E  j  log  rk-7T^Y  Zjk{yj  -  /be) 


fc=l 


2*2fe' 


G 


=  ^  nk  log  rk 


Z/p  77/  H~  G  2 


(27) 


iogff2 


fc=l 


?2  +  Efc=i  M/bt  -  /u)2  +  ELi  ^fc(//j  -  /be)2 


2cr2 

Using  (12)  and  (13),  the  M-step  objective  (23)  for  estimating  the  posterior  mode  then  be- 
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comes: 


2b  .  v-p  +  n  +  G  +  2  2  ^-p  T  nk 

2^  rafc  log  Tk - - - log  a - l  /ifc  - 

fc=i  2  Z(Ti 


'^kUk  "1"  Kv{iv 

kv  +  nk 


1  \  2 


KpTlk 


(k  +  nk)^k  ^ V ^  2/fc) 


The  M-step  parameter  estimates  are 


Tl kVk  T  K-p/i-p 

ftp  +  nk 


and 


a2  = 


?P  +  Efc=l  (K^+nk)(yk  l^v)2  +  Ej=l  Zjk{lJj  Vk)2 


v-p  -\-  ti  H~  G  H“  2 


C  The  Multivariate  Normal  and  Normal  Inverted  Wishart 
Conjugate  Prior 

The  likelihood  for  the  multivariate  normal  (n  observations)  is 
n;U0(%>,£)  =  n;U|2vrS|^  exp  {-!(%•  - /if 

=  n?=i  |27tE|_2  exp  {  — |trace  [S"1^-  -//)(///- /i)T]  } 

=  |2ttEP  ^  exp  {  — |trace  [s^1  Efifj  -  ~  h)T]  } 

=  (2tt)-it  |Ep  2  exp  {-|trace  [s-1  (n(/i  -  j/)(/x  -  £)T  +  Efif;  -  j/)  (Vj  -  y)T)]  }  , 

(28) 

where  we  have  used  (9)  and  (10)  to  obtain  the  final  form. 


Wishart  distribution.  If  X  is  an  rn  x  p  data  matrix  whose  rows  are  iid  A/"(0,  Ap),  then 
the  cross-product  matrix  XT X  is  has  a  Wishart  distribution 

XTX  ~  Wishart (iip,  Ap). 

The  parameters  vv  and  Ap  are  the  degrees  of  freedom  and  scale  of  the  distribution.  For 
multivariate  normal  data  Y,  the  Wishart  distribution  is  the  sampling  distribution  of  the 
sample  covariance  matrix. 

inverted  Wishart  distribution.  If  XT X  is  Wishart,  (XTX^J  is  inverted  Wishart: 

(xTx)~1  inverse  Wishart  ( vv ,  Ap). 
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The  inverted  Wishart  density  is  proportional  to 

/  u-p-\-d-\-l  \ 

(^XTX'j  '  '  exp  |  —  -trace  (A~1XtX^  j . 

The  inverted  Wishart  distribution  reduces  to  a  scaled  inverted  chisquare  distribution  in 
the  one- dimensional  case,  which  is  equivalent  to  an  inverted  gamma  distribution  when  the 
parameters  are  of  the  form  assumed  in  the  prior  discussed  above  for  univariate  data. 


normal  inverted  Wishart  prior.  A  conjugate  prior  that  is  commonly  used  for  the  mul¬ 
tivariate  normal  is  the  normal  inverse  Wishart  prior,  in  which  the  prior  for  yU  conditional  on 
E  is  normal: 

[ l  |  E  J\f (yU-p ,  E / 


oc  |  E  |  2  exp 


j  — -^trace  (/i  —  /iP)TS  1  (ji  —  fiv)  j 


and  the  prior  for  E  is  inverse  Wishart: 


inverseWishart  ( vv ,  AT. 


vv+d+ 1  f  1  T  1  ill 

oc  |  E  |  2  exp  |  — -trace  E  Ap  j. 


The  prior  density  is  therefore  proportional  to 
if  (^JL)  exp  j  —  ^ trace  (A^E-1)  jexpj-^ 


exp  | — f(,-^E  l(n~nv) 


(  2  )  exp  |  —  -trace  (E  1AT,1^|exp| — ^trace  E  1(/i  —  —  /iv)T  | 


Unconstrained  Covariance.  Combining  (28)  and  (31),  the  posterior  is  proportional  to 
| E |  (  2  )  exp  |  — -trace  ^E~1A“1^  |  exp  | — ^-trace  E —  fiv)(n  —  nv)T  j 

|Ef 2  exp  j-itrace  ^E"1  n(/z  -  y){n  ~y)T  +  (y^*-  y)(yj  -  y)T  jj 


u-p-\-n-\-d-\-2 

2 


exp  <  --trace  I  E  1  A^1  +  XX'/j  -  ~ 


-itrace(E  1  \kv(h  -  /v)(yU  -  nv)T  +  n{n  -  y)(n  -  y)T])  | 
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It  follows  from  (32)  and  (11)  that  the  complete  data  posterior  can  be  expressed  as  the  product 
of  an  inverted  Wishart  and  a  normal  distribution: 


K-p 

—trace 


S  (/i  /i:P)(/i  fip) 


(33) 


where 


kv  =  kv  +  n  vv  =  vv  +  n 

/v  =  (— ^— )  y  +  (— w— )  ^ 

\/Vp+n/  V/Vp  +  n/ 

A;1  =  A”1  +  (  )  (y  -  /v)(y  -  /l>)t  +  itivj  ~  V)(yj  -  v)T • 

\  rC-p  i  #6/  j  =  Y 


(34) 

(35) 


Like  the  prior,  the  complete-date  posterior  is  also  the  product  of  an  inverted  Wishart  and  a 
normal  distribution,  but  with  different  hyperparameters: 

H  |  E,  Y  ~  A /"(/ip,  k^E) 

E  |  y  ~  inverse  Wishart  ( dp ,  Ap ) , 

The  posterior  mode  corresponds  to: 

A  =  £p  =  ( — )  y  +  (— y— )  hv 

\Kp+n/  VKp  +  n/ 

^  _  K1  +  (jggp  (g -  /v)(y  -  /v)r  +  E"=i(yj  -  y)(yj  -  y)r 

hp  +  d  +  2 


Spherical  Covariance.  In  the  special  case  where  the  variance  E  in  (28)  and  (31)  is  as¬ 
sumed  to  be  spherical,  that  is,  of  the  form  a2/,  we  assume  an  inverse  gamma  distribution 
for  a2,  as  in  the  univariate  case,  giving  the  following  prior: 

2  /~i  n  /  2\  ~  iuv/2+l) 

a  ~  inverseGamma(z/p/2, Cp/2)  oc  (er  )  exp 


and 


h  I  o'2  ~  A/"(/ip,  (a2/ ftp)/)  oc  exp  {“^(a  -  hv)T(h  -  hv)|  • 

The  prior  density  is  proportional  to 

_ V'p+d+2  (  2  'j  (  K  \ 

(<T2)  “  exp(-^j)exp|-^-Ij(f1-fls)T(;i-^)J 


V'P 


(36) 


exp 


2a2  L 


?p  +  Kp(/X  -  |Up)T(,U  -  (Up) 
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and  the  posterior  is  proportional  to: 


b2) 


v-p-\-d-\-  2 
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exp 
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b2) 


nd 

f 

1 

fiT2  exp 

1  2a2 

+nd-|-d-|-2 

2 

f 

(m  /^p)J 

0  -  y)T(h  -  £)]  +  ZXs/j  -  £)r(%  -  v) 


3= 1 


exp 

1 


2cr2 


(/i  jlv)  (/i  /ip)|’ 


exp 


2cr2 


p  +  KvU  ( y  -  dv)T{y  -  M  +  itivi  ~  y)T(yj  ~  y) 


kv  +  n 


3= 1 


with  kp  and  /iP  are  as  defined  above  in  (34)  and  (35). 
The  parameter  values  at  the  posterior  mode  are: 


fJj  —  fJj-p  — 


n  \  /  ftp 

y  +  (  - - —  )  dv 


and 


kv  +  nj  *  '  \kv  +  n, 

&2  _  4  +  y  -  dv)T(y  -  dv)  +  X"=1 (yj  -  y)T(yj  -  y) 


v-p  nd  d  -|-  2 


(37) 


Diagonal  Covariance.  In  the  special  case  where  the  variance  E  in  (28)  and  (31)  is  as¬ 
sumed  to  be  spherical,  that  is,  of  the  form  diag(Sf),  i  =  1  we  assume  an  inverse 

gamma  distribution  for  each  diagonal  element  8f  of  the  covariance,  as  in  the  univariate  case, 
giving  the  following  prior: 

52  ~  inverseGamma(zvp/2,  <^/2)  oc  (<52)  ^  v ^  ^  exp 

and 

H  |  E  ~  N(yv,  (E / kv)I)  oc 


The  prior  density  is  therefore  proportional  to 
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sp 
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exp  --trace  |dia<?((5j  )  <c2/  f  exp  — —trace  |dia<?((^  )(/x  —  dv)(d  —  dv) 

(38) 
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and  the  posterior  is  proportional  to: 


if 

i= 1 


—  (z/p+2) 
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JJ  8i  (Uv+n+‘2'1  exp  --trace  ^ diag{8i  2) 
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3=1 
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exp  {  -  -trace 
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(39 

with  kr  and  fiv  are  as  defined  above  in  (34)  and  (35).  The  parameter  values  at  the  posterior 
mode  are: 

K/-p 


f~l  —  fdp  — 


n 

kv  +  n 


y  + 


Up  +  n 


dv 


and 


5. 2  = 


diag  +/  +  JgL(; y  -  yv)(y  -  yv)T  +  E"=i(2/j  -  !/)(%'  -  y)T 

V-p  +71  +  2 


-\  l2 


Y  +  ++  [(y  ~  /Ftl  +  E"=i  [(z/j  ~  y)i] 

Z+  +71  +  2 


d  +  e  • 


-  dv)(y  -  dv)T  +  ELi(%-  -  j/)(j/j  -  y)T 


K-p+n 


Up +  n 

where  e*  is  the  ith  column  of  the  identity  matrix. 


C.l  Multivariate  Normal  Mixtures 

For  multivariate  normal  mixtures  the  likelihood  or  density  has  the  form 

C{Y\t,h,H)  =  n"=1Efc=iTfe  <j>(yj  |  fik,^k),  (40) 

where  0  is  defined  as  in  (28). 

Analogous  to  the  univariate  case,  the  M-step  of  the  EM  algorithm  for  maximum  loglikelihood 
is  the  maximization  of 

G  n 

EE  zjk  {log  Tfc  +  log  (piy-j  I  yk,  al) }  (41) 

k=lj=l 
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where  z]k  is  the  probability  (computed  in  the  E-step)  that  the  jth  observation  belongs  to 
the  A; th  component,  and  the  M-step  for  posterior  modes  is  the  maximization  of  the  sum  of 
(20)  and  the  log  prior  (Dempster,  Laird  and  Rubin  1977). 


C.1.1  Unconstrained  Ellipsoidal  Covariance 


Assuming  that  fiv,  kv,  vv  and  Av  are  the  same  for  each  component,  the  normal  inverted 
Wishart  prior  for  a  multivariate  normal  mixture  with  unconstrained  covariance  across  com¬ 
ponents  is  proportional  to: 


G  f  v-p  -\-d-\-2  \  s  1 

II  lEfcl  '  2  '  exP  j -xtrace  (s^A^1) 


fc=i 


exp  \  -^trace  |Et  ‘(p*  -  ^v)Gk  -  M»)T 


(42) 


The  M-step  objective  for  maximum  loglikclihood  (41)  can  be  expanded  as  follows: 

G  n 

EE  Zjk  {logTfc  +  log  cj)(y 3 1 Hk,  E k)} 

k=lj=l 

G  (  n  n  d  log(2vr)  1 
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-  x  E  zJk  log  lsfcl  -  x  E  -ifctrace  Efc  ^  H k){Vj 
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nd  log(27r) 
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+  E  l  nk  log  Tk  -  ^  log  |Efe|  —  ^ trace  E*.1  ^  zjk(Vj  ~  dk)(Vj  ~  Hkf 


3= 1 
n 


k= 1 


3= 1 


(43) 

where  nk  =  YJj=\  zjk  and  J2k=i  nk  —  n-  Adding  the  log  prior  and  eliminating  constant  terms, 
the  function  to  be  maximized  in  the  M-step  is: 

u-p  +  d  -f-  2 


E- 


fc=i 


^  log  |Efc|  -  Erace  (s^A^1)  -  y trace  [e*.1^  -  /v)(/hk  ~  dv)T]  j 


G 
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k= 1 
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log  |E* 


3=1 

1 

-trace 
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(E^A;1 


G 


k= 1 


+  E  j  — ^ trace  (s^1^  -  nv)(nk  ~  dv)T)  ~  -trace  (  Efcx  ^  zjk{Vj  ~  dk)(Vj  ~  Hkf 

(44) 


3=1 


dkY 
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Combining  this  with  (12),  the  M-step  objective  can  be  written  as: 


X  {Uk  loS  Tk  ~  (Zy  +  ^  +  ~  )  loS  I 


1  G 

-V 

2  ^ 
z  k= 1 
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| trace 

d  ^ 
w 

l 

\  L 

-  J/fc)(j/i  -  Vkf 


3= 1 


(45) 


trace  (Ej1  Kp(ftt  -  pP)(pi..  -  p„)T  +  nt(<ft  -  Pt)(</i,  -  /n)T  )  . 


fc=l 


Using  (13),  the  M-step  objective  for  estimating  the  posterior  mode  can  be  expressed  as  a 
sum  of  separable  terms 


G 


X]  \  nk  log  rfc 

k=  1 


where 


z/p  +  nk  H — [-2 


)  log  |Sfc|  -  ytrace  (s*,1^*,  -  fik)( nk  ~  j^k)T)  ~  ^trace  (s^Ap1)  J  , 

(46) 


dfc  =  uv  +  nk  hk  =  kv  +  nk 
f^kUk  T  Ztp/ip 


l^k  — 


Kp  +  nk 


-l 


Avfc  =  Ap1  + 


K-p'Tlk 


(Ap  +  flk)  j=i 


The  parameter  values  corresponding  to  the  posterior  mode  are  as  follows: 

k^kUk  T  K-p/Jrp 


l-^k  l^k  — 


Kp  +  Tlk 


Si-  = 


Kl  +  (KpTnfc)  ~  /^Kgfc  ~  /uOr  +  XU  ~  y^i  -  y*YJ 

vv  +  nk  +  d  +  2 


(47) 

(48) 

(49) 


C.1.2  Equal  Ellipsoidal  Covariance 

Assuming  that  /iP,  kv,  vv  and  AP  are  the  same  for  each  component,  the  normal  inverted 
Wishart  prior  for  a  multivariate  normal  mixture  with  equal  covariance  across  components  is 
proportional  to: 

(  vv  +  d+l\  ,  1  'i  G 

|S|  V  2  /  exp  | --trace  (s^A"1)  1 

1  z  >  k= i 

(50) 


exp 


- trace 
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S  (/ifc  h"p)(h'fc  h"p) 
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The  M-step  objective  for  maximum  loglikelihood  (41)  can  be  expanded  as  follows: 

G  n 

EE  Zjk  {logTfc  +  log  (f>(yj  Ink, E) 

k= i 1= i 

=  E  (  E  zik  log  Tk  -  Y  E  zik  log  lsl  -  77  E  Zjfctrace  I  S"1^!  -  Rk)(yj  ~  Rkf 


9  9 
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rid  log(27r)  n 
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G  n 


77  log  1^1  —  —trace  E  1  Y  E  zjk(Vj  ~  Rk)(Vj  ~  Rk? 


k= 1 


fc=i i=i 


(51) 

where  nk  =  Z)”=i  Zjk  and  EE  nk  =  Adding  the  log  prior  and  eliminating  constant  terms, 
the  function  to  be  maximized  in  the  M-step  is: 


V'p  T  d  T  1 


G 


k= 1 


log  lsl  -  2trace  (S  1APX)  +  E  I  _2  log  ^  ~  trace  [E  l(yk  -  Rv){Rk  -  rv? 
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E 
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(52) 


Using  (12)  and  (13),  the  M-step  objective  for  estimating  the  posterior  mode  becomes: 
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(53) 

where  Rk  and  jxk  are  as  defined  in  (47)  and  (48).  The  parameter  values  corresponding  to  the 
posterior  mode  are  as  follows: 

kikVk  T  kivy,v 


Rk  Rk  — 


kv  +  nk 
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E  = 


K1  +  X*=l  [(«7+nfc)^fc  _  _  ^)T  +  EE  *!*(%’  -  2/fc)(%-  -  VkY 


uv  +  d+  l+  n  +  G 


C.1.3  Unequal  Spherical  Covariance 

Assuming  that  fiv,  kv,  vv  and  ^  are  the  same  for  each  component,  the  normal  inverted 
Wishart  prior  for  a  multivariate  normal  mixture  with  unequal  spherical  covariance  across 
components  is  proportional  to: 


ft  (d) 


i-'-p  +^+2 


k=  1 


exp 

ft  G) 

k= 1 


c2 

Sp 


I 

i/-p+ti+2 
_  2 


exp 


exp 


dv)  (l^k  Ah’)  \ 

.  2 4  j 

<?2  +  Kyjjlk  -  dr)T(dk  -  /v) 

2<Jfc 


(54) 


The  M-step  objective  for  maximum  loglikelihood  (41)  can  be  expanded  as  follows: 
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(55) 

where  nk  =  Z)”=i  and  ELi  nk  =  n-  Adding  the  log  prior  and  eliminating  constant  terms, 
the  function  to  be  maximized  in  the  M-step  is: 
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Using  (12)  and  (13),  the  M-step  objective  for  estimating  the  posterior  mode  becomes: 


f  .  v-p  T  d  T  2  T  nkd  2  /  -  -.T/  ~  \ 
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where  Rk  and  p,k  are  as  defined  in  (47)  and  (48).  The  parameter  estimates  corresponding  to 
the  posterior  mode  are  as  follows: 
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C.1.4  Equal  Spherical  Covariance 

Assuming  that  /iP,  z/p  and  are  the  same  for  each  component,  the  normal  inverted 
Wishart  prior  for  a  multivariate  normal  mixture  with  unequal  spherical  covariance  across 
components  is  proportional  to: 
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The  M-step  objective  for  maximum  loglikclihood  (41)  can  be  expanded  as  follows: 
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where  nk  =  J2j= 1  and  X)5t=i  nk  =  '«-•  Adding  the  log  prior  and  eliminating  constant  terms, 

the  function  to  be  maximized  in  the  M-step  is: 
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Using  (12)  and  (13),  the  M-step  objective  for  estimating  the  posterior  mode  becomes: 
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(61) 

where  Ac*.  and  fik  are  as  defined  in  (47)  and  (48).  The  parameter  estimates  corresponding  to 
the  posterior  mode  are  as  follows: 
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C.1.5  Unequal  Diagonal  Covariance 

Assuming  that  /iP,  /%,,  and  are  the  same  for  each  component,  the  normal  inverted 
Wishart  prior  for  a  multivariate  normal  mixture  with  unequal  diagonal  covariance  across 
components  is  proportional  to: 
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The  M-step  objective  for  maximum  loglikelihood  (41)  can  be  expanded  as  follows: 
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where  nk  =  X)”=i  %'fe  and  Ea—i  =  n-  Adding  the  log  prior  and  eliminating  constant  terms, 
the  function  to  be  maximized  in  the  M-step  is: 
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Using  (12)  and  (13),  the  M-step  objective  for  estimating  the  posterior  mode  becomes: 
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where  Kk  and  fik  are  as  defined  in  (47)  and  (48).  The  parameter  values  corresponding  to  the 
posterior  mode  are  as  follows: 
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where  e*  is  the  ith  column  of  the  identity  matrix. 


C.1.6  Equal  Diagonal  Covariance 

Assuming  that  fiv,  kv,  vv  and  <,2  are  the  same  for  each  component,  the  normal  inverted 
Wishart  prior  for  a  multivariate  normal  mixture  with  unequal  diagonal  covariance  across 
components  is  proportional  to: 
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The  M-step  objective  for  maximum  loglikclihood  (41)  can  be  expanded  as  follows: 
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where  nk  =  J2j= 1  zjk  and  Efc=i  nk  =  n-  Adding  the  log  prior  and  eliminating  constant  terms, 
the  function  to  be  maximized  in  the  M-step  is: 
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Using  (12)  and  (13),  the  M-step  objective  for  estimating  the  posterior  mode  becomes: 
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where  and  gk  are  as  defined  in  (47)  and  (48).  The  parameter  values  corresponding  to  the 
posterior  mode  are  as  follows: 
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where  e*  is  the  ith  column  of  the  identity  matrix. 
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